Hashing

\boxed{ \text{In a Nutshell: } \text{Key} \to \text{Hash Code} \to \text{Hash Index} }

Why?: Hashing lets us turn search keys into integer indices that can be accessed at O(1)
Hashing is a data-dependent problem, there’s no one “right” way to do it.

Hash Function: Takes a search key and produces the integer index of an element in the hash table.

Search key is mapped—or hashed—to the index.
The same key must have the same hash during execution.
Collision: When two or more keys have the same hash.

Remember: Override the Object class’s default hashcode when necessary!
The default hashcode function just returns an object’s memory address.
Example: The String and Integer class already implement their own hashcode function.

Real-world Example: CRC

Checking if a message has been modified by its CRC value.
- e.g., JPL uses CRC to determine whether messages sent by Pioneer are garbled.
Error-correction codes in file systems.

Hash Code and Hash Index

Typical Hashing:

Convert search key to integer (hash code)
Compress code into range of indices for hash table.

Hash Code v.s. Hash Index
Hash Code: The integer result of the hash function.
Hash Index: The hash code % Table Length
(This is what we use as the search key.)
aka: Compressing

Good Hash Functions

Minimize Collisions
Are fast to compute.

Note: Data Structure for Hash Tables
The only data structure compatible with hash tables are arrays, because random accesses are O(1)

Example: Ideal Hashing

Function add(key,value) {
  index=h(key);
  hashTable[index]=value;
}

Supposing that we have perfect hashing (no collision).

Computing Hash Codes

Example: Hashing a Unicode string

Naive Approach: Sum every character’s Unicode integer value.
Another Approach: Multiply each character’s integer value by its position in the String and sum the values. u_0 g^{n-1} + u_1 g^{n-2} + u^{n-2} g + u_{n-1}

int hash = 0;
int n = s.length();
for (int i = 0; i < n; i++) {
  hash = g * hash + s.charAt(i);
}

(The g constant was arrived at statistically)

Hash Code for Primitives:

Use the key itself if data type is int
Byte, short, char: Cast to int
Other primitives: Manipulate internal binary representations.

Compressing a Hash Code (Scaling Method)

After getting a hash code, we need to compress it into a usable hash index.

One way to do this is to mod the hash code by the length of the hash table.

\text{Scaling Method: Hash Code} \% n

Best to use an odd number for n
Prime number often gives good distribution of hash values.

private int getHashIndex(K key) {
    int hashIndex = key.hashCode() % hashTable.length;
    if (hashIndex < 0) {
        hashIndex = hashIndex + hashTable.length;
    }
    return hashIndex;
}

Note how we handle negative hash codes!

Important: Resizing a Hash Table
Because the hash index is dependent on the size of the hash table, resizing the hash table isn’t a trivial copy-by-value to be bigger array—you need to re-hash the values.

Handling Collisions

Reducing Collisions:

Improve the hash function
- e.g., distributing entries uniformly throughout the hash table.
  - (usually requires statistical analysis)
Increase table size.

Handling Collisions:

Use another location in the hash table
- (e.g., linear probing)
Change the structure of the hash tale so that each location can represent more than one value.
- (this introduces searching and a whole slew of problems like deleting entries)

Three Kinds of Locations in a Hash Table:

Occupied
- Location references an entry in the dictionary
Empty
- Location contains null and always has been.
Available
- Location was previously occupied, but is now available.

A. Linear Probing

Linear Probing: Resolves collision during hashing by examining consecutive locations in hash table (looping back around if we reach the end)

Basically just probing the next location until we find an empty location.
If we make it back to the starting location, we know that the table is full.
(We must differentiate between empty and available locations)

Note: This resolves collision during additions, but complicates removals.

Clustering: Collisions resolved with linear probing cause groups of consecutive locations in hash table to be occupied, called clusters.

Bigger clusters mean longer search times following collision.

B. Open Addressing with Quadratic Probing

Quadratic Probing: Considers the locations at indices k + j^j, rather than going consecutively from k.

e.g., probing k, k+1, k+4, etc.

C. Open Addressing with Double Hashing

Double Hashing: Use a second hash function to compute increments.

D. Separate Chaining

Separate Chaining: Track collisions with linked list.

Bucket: Each location is called a bucket, it can represent more than one value.

More on Implementation

Successful retrieval/removal has same efficiency as successful search.
Unsuccessful retrieval/removal has same efficiency as unsuccessful search.
Successful addition has same efficiency as unsuccessful search.
Unsuccessful addition has same efficiency as successful search.

Load Factor (\lambda)

Load Factor: Measure of cost of collision resolution

\lambda = \frac{\text{Num. of entries in dictionary}}{\text{Num. of locations in hash table}}

Never negative.
For open addressing, 1 \ge \lambda
For separate chaining, \lambda has no maximum.
Reducing \lambda improves performance.

Maintaining Performance:

< 0.5 for open addressing
< 1.0 for separate chaining
If load factor exceeds these bounds, increase the size of the hash table.

Rehashing:

Compute new size
1. Double present size
2. Increase result to next prime number
Use add to add current entries in dictionary to new table.

More Advanced Entries

Example: Another private inner-class to encapsulate key, value pairs along with state.

private static class TableEntry<K,V> {
  private K key;
  private V value;
  private States state;
  private enum States { CURRENT, REMOVED }
  private TableEntry(K key, V value) {
      this.key = key;
      this.value = value;
      this.state = States.CURRENT;
  }
}

Not necessary if we’re doing linear probing.

Java Class Library: The Class `HashMap`

Variety of constructors provided.
Default maximum load factor of 0.75
- Increases size when limit exceeded.
Possible to avoid rehashing by setting number of buckets larger.

Java Class Library: The Class `HashSet`

Implements java.util.Set interface
Uses an instance of HashMap